Automatic Discovery of Semantic Structures in HTML Documents
نویسندگان
چکیده
Template-driven HTML documents posses an implicit, fixed schema denoting concepts and their relationships in a hierarchical fashion. Discovering this schema remains a relatively unexplored problem. By exploiting a key observation that semantically related items in HTML documents exhibit spatial locality, we develop an algorithm for automatically partitioning them into tree-like semantic structures which expose the implicit schema.
منابع مشابه
Transforming Arbitrary Tables into F-Logic Frames with TARTAR
The tremendous success of the World Wide Web is countervailed by efforts needed to search and find relevant information. For tabular structures embedded in HTML documents typical keyword or link-analysis based search fails. The Semantic Web relies on annotating resources such as documents by means of ontologies and aims to overcome the bottleneck of finding relevant information. Turning the cur...
متن کاملAutomatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis
Although RDF/XML has been widely recognized as the standard vehicle for representing semantic information on the Web, an enormous amount of semantic data is still being encoded in HTML documents that are designed primarily for human consumption and not directly amenable to machine processing. This paper seeks to bridge this semantic gap by addressing the fundamental problem of automatically ann...
متن کاملReverse Engineering for Web Data: From Visual to Semantic Structures
Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of ”legacy” data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary. This paper describes...
متن کاملA Formal Ontology Discovery from Web Documents
The huge amount of documents distributing over WWW can be regarded as easily accessible resources of domain-specific knowledge. However users can be also annoyed with the quantitative enormousness, qualitative irregularity, and unfamiliarity of contents of the documents arising from easy accessiblity to specific domains and unstructuredness of WWW. One of the possible solutions to this problem ...
متن کاملReverse Engineering for Web Data: From Visual to Semantic Structure
Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of ”legacy” data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary. This paper describes...
متن کامل